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Abstract 

We present a comparative analysis of large-scale topological and evolutionary properties of transcription 
networks in three species, the two distant bacteria E. coli and B. subtilis, and the yeast S. cerevisiae. The 
study focuses on the global aspects of feedback and hierarchy in transcriptional regulatory pathways. While 
confirming that gene duplication has a significant impact on the shaping of all the analyzed transcription 
networks, our results point to distinct trends between the bacteria, where time constraints in the transcription 
of downstream genes might be important in shaping the hierarchical structure of the network, and yeast, 
which seems able to sustain a higher wiring complexity, that includes the more feedback, intricate hierarchy, 
and the combinatorial use of heterodimers made of duplicate transcription factors. 
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I. INTRODUCTION 



Cells need constant sensing of environmental changes and internal fluxes, and correct response 
to these external and internal stimuli through the simultaneous expression of a large set of genes. 
The basal mechanism that performs this task is transcriptional regulation. Depending on species, 
context and specific function, this can involve simple interactions or complex signaling cascades, 
but in general it involves a large number of genes . For this reason, it is nec- 

essary to characterize this regulatory process from a global, or "network" point of view. To this 
aim, transcriptional regulation networks are defined starting from the basic functional elements 
of transcription [7]. This information is represented as a directed graph, usually identifying each 
gene transcript and their protein products with an unique node, and each regulatory interaction 
with a directed edge A — ► B between the node B (the target gene) and the node A (the gene 
coding for a transcription factor (TF) that has at least one binding site in the cis -regulatory region 
of B). A transcription factor regulating its own expression is called an autoregulator (AR). With 
this definition, the interaction graph structure is given by large-scale and collections of small-scale 



experiments [8|,|9L 
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A basic way to understand the architecture of transcription networks it to consider their topol- 
ogy. Topological analysis is able to capture functional properties, and important architectural 



features of the network 



us is able to capture n 
113 



180. Examples of topological properties are the 



so-called network motifs [8], or the degree distribution of the connectivity of the nodes Q19Q. These 
observations need to be compared with suitable null models. For example, "network motifs" are 
subgraphs for which the probability P of appearing in a suitably randomized network an equal 
or greater number of times than in the empirical network is lower than a given cutoff value. For 
a meaningful comparison, randomized networks are taken with the same single-node characteris- 
tics of the empirical network. Usually, each node in the randomized networks is constrained to 
have the same number of incoming and outgoing edges (degree distribution) as the corresponding 
node has in the empirical network [2(1 21]. Considering more large-scale properties, the known 
transcription networks possess a hierarchical feedforward layered structure [|15L I17Q . and often 
feedback is mainly limited to a rather large set of autoregulations Il22ll . The perspective on the 
topology transcription networks is enriched by taking an evolutionary point of view. Evolution of 
a transcription network is driven by three main biological mechanisms: (i) gene duplication, (ii) 
rewiring of edges by mutation/selection of TF/DNA interactions and (iii) horizontal gene transfer. 
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The first mechanism has been shown to play a substantial role, although the extent to which it can 



shape the network is debated 



24 



25 



tifs do not emerge from duplication events O, 



from gene duplications 



. For example, it has been shown that network mo- 



2511 . while other topological properties have arisen 



. We have previously considered from this viewpoint the properties 
of hierarchy and feedback in the E. coli network Ill8ll . finding that gene duplication can be held 
responsible for the preservation of self-regulations, and for the "shallow" layered organization, 
which one can hypothesize to optimize the time constraints for the production of targets. 

In this paper, we present a comparative study of transcription networks, which extends the 
analysis on E. coli, and considers also the evolutionary distant bacteria B. subtilis, and two dif- 
ferent data sets for the eukaryote S. cerevisiae. From comparison of these data we distinguish 
between unifying and distinct features of the three networks. In particular, while in the two bacte- 
ria feedback loops involve few nodes and the number of hierarchical layers is minimal, yeast shows 
more feedback and a more complex hierarchy of transcription factors. Autoregulatory interactions 
are always abundant, although their role appears to be different between yeast and the bacteria. 
The fraction of self-regulators is above 50% in the bacterial data sets, dropping around 10% for 
S. cerevisiae. If we take into account the effects of evolution on topology, a richer, more complex 
scenario appears. Our analysis is focused on the mechanism of gene duplication, and is based on 
a network growth model which considers this drive to be the only one present. With this method 
it is also possible to infer on other mechanisms indirectly. Our main findings are the following, (i) 
Duplications play an important role in evolution of the TN: the relative abundance of simple net- 
work subgraphs stemming from duplication is evident in all data sets, (ii) Gene duplications shape 
the degree sequences and the hierarchy of the network, as predicted by a duplication-divergence 
model, (iii) The feedback core in the yeast network may be shaped by duplications of existing 
feedbacks (iv) The yeast network tends to form heterodimeric TF pairs from duplicates, which 
seem to be forbidden in E. coli, possibly because of the same selective pressure erasing crosstalks 
from duplicate ARs. 



II. FEEDBACK AND HIERARCHY. TOPOLOGICAL EVALUATION. 

Feedback is present in our network if closed directed paths exist. We quantify the amount of 
feedback with the "leaf-removal" algorithm. This decimation algorithm iteratively removes the 



input and output tree-like components (i.e. parts without loops) of a directed graph ||27|1 . The 



3 



outcome can be either the empty graph or a non-empty subgraph of the original. In the former 
case, the whole network is tree-like. In the latter, we say that the leftover subgraph represents the 
feedback core of the network. The size of this core (number of nodes or edges, with respect to the 
complete network) can be used as a measure of the amount of feedback. 

The reverse, or complementary, of feedback is hierarchy. Neglecting self-regulations, we de- 
fine roots the nodes which are not regulated by other nodes, and leaves the nodes which do not 
regulate other nodes. Starting from the roots of the graph we can define hierarchical layers con- 
sidering the relative ordering of nodes in a chain of regulatory interactions, (i) If we consider a 
tree-like graph (or a tree-like subset of it), each iteration of the leaf -removal algorithm removes a 
number of nodes, either roots or leaves, and the edges pointing from or to these nodes. By defi- 
nition, the removed nodes do not interact with each other, and can be thought to form a distinct 
computational "layer". In this case, the leaf-removal algorithm described above naturally defines 
a hierarchy between the nodes, through the longest path to a root or to the feedback core upstream 
of a given node. In the general case, we define the number of layers in the hierarchy by the number 
of iterations of the leaf -removal algorithm needed to remove all the nodes outside of the feedback 
core (thus as a sum of layers downstream and upstream of the feedback), (ii) Hierarchical layers 
can be defined in a similar way through the longest open chain of regulators that each of their 
members share. In a longest-path hierarchy, and in absence of feedback, members of layer one 
are regulated by at most themselves. Members of layer two are regulated by a chain of one, and 
no more, nodes and possibly themselves, and so on. This definition, though conceptually similar 
to that given by the leaf removal, is computationally demanding since the algorithm represents a 
NP-complete problem. We did not use this definition in our analysis, (iii) An alternative definition 
(computationally easier) is given by the shortest open directed paths between nodes. The number 
of layers is computed considering the longest among the shortest paths from any pairs of nodes 
(which can be found in polynomial time, for example with the Dijkstra algorithm [28]). Short- 
est paths measure the minimal number of intermediary transcriptional interactions required for a 
signal from a transcription factor to reach a given target downstream, which can be interpreted as 
hierarchical layers. On the other hand, a straightforward uni vocal definition of hierarchical layers 
using shortest paths is difficult to produce Ill7ll . 

In order to quantify the significance of feedback and hierarchical properties of the three tran- 
scription networks (five data-sets), we compared them with random ensembles of networks having 
the same degree sequences, i.e. conserving the number of incoming and outgoing edges for each 
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FIG. 1: Feedback and hierarchical layers for the three networks of E.coli (RegulonDB 5.5) B. Subtilis 
(DBTBS) and S. Cerevisiae (Balaji data set), compared to randomized instances. The data on E. coli are 
compatible with older data sets. The two data sets on yeast do not give compatible results; data refer to most 
recent one. Top: Number of nodes Nc in the feedback core. Middle: Number of leaf-removal layers for the 
empirical networks and randomizations having the same or lower Nc as the empirical ones (to improve the 
size of the sample, we show Nc < 70 for the case of yeast, however, the results do not change for lower 
thresholds). Releasing the constraint on Nc gives a weaker signal for the bacteria. Bottom: Number of 
layers in a shortest-path hierarchy, without constraints on Nc for the randomized cases. A constraint for 
the empirical Nc gives a stronger signal. 



node. For all the bacterial data sets, we find consistently the following. The feedback core is 
slightly smaller than the typical random case, and the number of layers is minimal. This is true 
both for a longest path and a shortest path hierarchy. Data are shown in the first two columns of 
Fig. [Q On the other hand, the case of S. cerevisiae is more complex. First, we observe that the 
two data sets we analyzed are not consistent with each other. We believe this to be due to their 
large differences in size: the older Guelzim data set Il29ll contains approximately 800 interactions, 
while the more recent Balaji data set l!26ll contains around 13000. Considering the Guelzim data 
set, the almost tree-like topology does not differ from randomizations in any of the observables 
considered above. For the Balaji data set, we find contrasting trends compared to the bacterial 
networks. The rather large feedback core involving 60 nodes and 207 interactions is far larger than 
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the one found in the two bacterial networks. On the other hand, it is significantly smaller than the 
typical feedback core found in randomizations. Interestingly, the number of leaf-removal layers 
falls in the average, in contrast with the trend observed in bacterial networks. Even more surpris- 
ingly, the number of shortest-path layers in the yeast network greatly exceeds randomizations, in 
strong contrast with the bacterial data- sets. Data are shown in the third column of Fig. [TJ More in 
detail (Supplementary Fig. [fl the lengths of these shortest-paths have a Poisson-like distribution 
in both the empirical graphs and their randomizations, but for the yeast network the tails of their 
distributions are significantly shifted between the two. 

A possible rationale for these observations is the importance, in bacteria, of minimization of 
the time-scales for the production of target genes, for example with structural connotation. Since 
each transcriptional step of an expression program takes an estimated time of the order of one cell 



cycle BOh . The same argument might also explain the abundance of self -repressed transcription 



factors, which reach steady-state expression in shorter times yOfl. This kind of pressure is possibly 



released from yeast due to a more efficient organization of gene expression and degradation. 



III. EVOLUTIONARY DUPLICATION-DIVERGENCE GROWTH 



Model. We use a simple duplication-divergence model for the growth of the transcription 



network dl8 



3lL 



321] to guide the data analysis. The model is formulated as follows. At each step 
all, or a fraction of the nodes are duplicated. The duplicate nodes inherit all the in- and outgoing 
edges of the original (ancestral) node. In other words one supposes that before divergence, the 
binding sites on the proteins and the regulatory regions on DNA remain identical. Subsequently, 
the new edges are removed with a certain probability, that might depend on the status of the node 
(regulator or target, subject or regulating a new or an old gene, etc.). 

A qualitative study of the model leads to the following schematic results Jilt. First, by defini- 
tion there is no possibility of de novo addition of edges by rewiring. Hence no edges can be created 
from an homology class that initially does not regulate another one. In the simplest case, if feed- 
back is initially absent, it cannot arise spontaneously. Moreover, no hierarchical layer can be added 
to the network, and duplicates selected for fixation will lie in the same layer. On the contrary, in the 
presence of feedback, and ARs in particular, the model behaves differently. Duplication of ARs 
can give rise to higher order feedback and to new ARs. However, this feedback will be strictly con- 
fined among members of the same homology class. Analogously, the degree distribution is piloted 



6 



by the degree distribution of duplicates. The in- and out- degree distributions can be decoupled by 
choosing different removal probabilities for old/new and new/old edges. One can obtain power- 
law out-degree and compact in-degree distributions that are in close relation with the phylogenetic 



conservation of the network proteins 02l . 



3311 : conserved nodes must exhibit at least one between 



the in- and out-degree sequence with a scale-free distribution, while proteins with only exponential 
degrees cannot be phylogenetically conserved under a general duplication-divergence model. Ac- 
cording to this model, bacterial transcription networks with scale-free out-degrees and exponential 
in-degrees are thus consistent with a phylogenetically conserved set of transcription factors and a 
non-phylogenetically conserved set of target genes (possibly due to abundant horizontal tranfers 
of target genes). 

In a realistic situation, the above observations may not apply because of the possibility of 
edge rewiring, that may be able to shuffle the hierarchy, create new feedback, affect the degree 
distribution, and mix edges among homology classes. Note, however, that some regulatory edges 
between apparently different homology classes may have also resulted from very old edges within 
ancestral homology classes that are no longer classified as single homology classes due to their 
gradual divergence. Besides, while actual edge rewiring may have occurred after homology class 
separation, it is clear that edge deletion must typically happen after gene duplication (or else extant 
regulatory networks would be densely cross-edged graphs.) Hence, this model, which focuses on 
the necessary deletion of duplicated edges, can assess the specific role of gene duplication and edge 
deletion in the growth of the network and also underline their possible shortcomings to delineate 
the role of other evolutionary processes such as edge rewiring. 

Data Evaluation. Having in mind the above qualitative results we examine the topological 
roles in empirical networks of nodes coming from the same common ancestor. We define proteins 
that are likely to share a common ancestor through structural domain architectures 11141 134| . 35]. 
These domains allow for the definition of larger classes than sequence comparison alone [14]. 
The database enables to associate an ordered sequence of domains, or "domain architecture" to 
each protein. We define protein homologs as proteins whose domain architectures are identical 
neglecting domain repeats. This corresponds to a conservative view of homology where no new 
domains are acquired or lost after duplication. The results are tested using different definitions of 
homologs Pol . Homology allows us to construct sets, or classes, of proteins which have then sup- 
posedly evolved from a common ancestral gene. We have analyzed the distribution of regulatory 
edges between and within classes of the likely duplicate genes. The statistical significance of the 
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analysis in terms of homology classes is established [7] by comparison with random shuffling of 
genes (TFs and TGs separately) between classes. This analysis is limited by the number of nodes 
for which homology classes can be constructed, and thus less sensitive to the particular data set. 

In all cases we find that the motifs of duplicated TGs, regulated by a single or duplicate TFs, 
and duplicate TFs regulating a common target (Fig. O, are significantly overrepresented [14], 
which can be seen as a validation of the model. The signal for this is smaller in the smaller data 
sets (such as B. subtilis), because of poor statistics. 

Degree sequences and duplications. A first question to address is whether gene duplications 
are able to pilot the observed degree sequences of the network, as predicted by the model. For 
the case of yeast, one can perform this test on the above defined homology classes, and also, on 
the doubly conserved genes in the yeast whole-genome duplication proved by Kellis et al. [37]. 
Precisely, one can measure the degree distributions restricted to targets or regulator falling in the 
same homology class, and verify whether it scales in a similar way as the unrestricted distribution. 

Our results, considering the Balaji data set for yeast and RegulonDB 5.5 for the E. coli, are 
shown in Fig.|2] The homologs follow a qualitatively similar distribution to the entire network, ex- 
ponential for the in-degree, and power-law-like for the out-degree. For the in-degree distributions 
we find a better agreement between the decay of the entire network and the restriction to homology 
classes. This confirms the general idea that the degree distributions are strongly dependent upon 
duplications. However, in the case of yeast, the out-degree of the entire network seems to follow 
a broader distribution than that of the duplicates (top-left panel of Fig. |2]). This can be seen as 
an indication that other processes, such as rewiring, concur in defining the targets of a TF. The 
behavior of the duplicates from the whole-genome duplication is conditioned by the small size of 
the sample, and its degree distribution drops with a faster decay for both the in- and the out-degree. 

Duplication of Autoregulatory Circuits. Our analysis shows that the role of autoregulatory 
interactions is rather different in the bacterial data sets compared to yeast. The fraction of self- 
regulated TFs is above 50% in all three bacterial data sets, while being around 10% only in the 
two S. cerevisiae data sets. If we measure directly the distribution of ARs in classes, in the E. coli 
datasets we find a consistent signal that the population of ARs in homology classes is more dense 



and more variable than in randomized instances [18]. This is a direct evidence of duplication and 
conservation of ARs, and agrees with the conservation of a hierarchical structure during evolution- 
ary growth. In B. subtilis and yeast, we find that the population of ARs in classes falls closer to 
(although systematically above) the average of the randomized samples. This means that, in these 
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FIG. 2: In- and out-degree distributions of the entire S. cerevisiae (Balaji data set) and E. coli (RDB 5.5), 
compared to those obtained from homolog TGs and TFs. Homology is assessed with domain-architecture 
homology classes, and in the yeast, through doubly conserved genes in whole genome duplication. The in- 
degree shows good agreement (right panel) with the prediction of the duplication model. For the out-degree 
distribution, instead, data for the S. cerevisiae (top left) for the whole network seems to be wider than the 
homology classes, suggesting that processes other than duplication concur in shaping the out-connectivity 
of transcription factors. Some curves have been smoothed out by averaging nearby bins in order to enhance 
the visibility of the tails. 



cases, the pure analysis of AR population in homology classes is not sufficient to assess that ARs 
in the same class are the result of duplication and inheritance of self-edges. In the smaller data-sets 
(such as DBTBS |38|]) this absence of signal could be likely due to small size of the sample, so it 
is possible that this is another trend differentiating bacteria from eukaryotes. 

On the other hand, in all the data sets we find evidence for the relevance of AR duplications if 
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|(i) (ii) (iii) (iv) (v) (vi) 

FIG. 3: Top: scheme of the possible circuits stemming from the duplication of a transcriptional self- 
regulation after partial inheritance of interactions. Bottom: network subgraphs compatible with AR dupli- 
cation and inheritance of crosstalks within homology classes. The circles represent homologous nodes. 



a 


Empirical 


Randomization 


Z-score 


E. coli (RDB 5.5) 


0.23 


0.10 ±0.082 


+1.6 


B. subtilis 


0.03 


0.0042 ± 0.0073 


+3.55 


S. cerevisiae (Balaji) 


0.13 


0.020 ± 0.039 


+2.79 



TABLE I: Subgraphs compatible with AR duplications, containing crosstalks (Fig. [3]>. The observable a 
measures the ratio between subgraphs and edges in the network, and is defined in the text. 

we consider the distribution of the subgraphs that can stem from AR duplication (Fig. [3]) within 
homology classes (Table HJ). These subgraphs are compatible with AR duplication conserving 
crosstalks. In order to quantitatively evaluate the relevance of the subgraphs, we define an observ- 
able that counts the occurrence of the following events 

1. An AR regulates a homologous, non-self-regulatory transcription factor. 

2. An AR is regulated by a homologous, non- self -regulatory transcription factor. 

3. Two homolog ARs possess reciprocal cross regulation. 

We indicate with N c the total number of homology classes, with Mj the total number of the sub- 
graphs of the three kinds found in homology class i, and by L, the number of edges in the class, 
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and define 

1 ^ Mi 
a = — > — 
N c 4-f Li 



This observable represents the ratio of subgraphs to the total number of (self- and non-self-) edges 
present in the homology class. 

We find that, although the fixation of cross-talks within homology classes is typically less fre- 
quent that the retention of self-edges in bacteria [18], the AR duplication subgraphs are systemat- 
ically overrepresented, as shown by the positive signal. Specifically (see Supplementary Table [B, 
in E. coli and B. subtilis, this signal is dominated by duplicated self-interactions followed by loss 
of crosstalks, while for the S. cerevisiae network, subgraphs including crosstalks are more abun- 
dant, compatibly with the observed more complex hierarchy ||18fl . In addition, in the significantly 
overrepresented AR duplication subgraphs (type ii in figure [3]), only one of the two possible self- 
regulations of duplicate TFs survives, indicating that ARs do not proliferate by duplication. This 
observation indicates that duplications have been important to shape the non- self -regulatory edges 
within homology classes, thereby redefining the TF hierarchy, besides leading to the fixation of 
new ARs in the bacterial networks. This is especially the case in yeast, indicating that circuits 
with crosstalks or feedback stemming from AR duplication are not negligible, and might have 
contributed to the build-up of the existing feedback. 

Interaction Across Homology Classes and Evolution of the Feedback Core. To gain further 
insight into this point, we evaluated the distribution of interactions within and across homology 
classes. Considering the transcription factors homology classes, we constructed a "collapsed" 
weighted network as follows. The nodes are homology classes, and weighted oriented edges 
represent the number of members of a class regulating members of another class. In absence of 
rewiring, this would be a likely "ancestral" TF-TF network, inherited by duplication of "primitive" 
self- and heterologous edges. Our model dictates that this ancestral graph can only be hierarchical 
or maintain the same amount of feedback of the initial configuration. In particular, if an ancestral 
edge is sent out from a class to a second one and the reverse is not true, there will be no edges 
coming from the second class to the first in the evolved network. This corresponds to an asymmet- 
ric regulatory control of the first homology class over the second homology class. Conversely, the 
appearance of symmetric regulations may be a signature of rewiring, if one assumes that retaining 
both ancestral crosstalks produced by the duplication of an ancestral AR was then as unlikely as 
it appears to be between recent AR duplicates. Hence, symmetric regulation between homology 
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R 


Empirical 


Randomization 


Z-score 


E. coli (RDB 5.5) 


9.87 


7.71 ±3.65 


±0.59 


B. subtilis 


3.32 


3.46 ± 2.64 


« 


S. cerevisiae (Balaji) 


15.63 


10.72 ±2.05 


±2.39 



TABLE II: Rewiring parameter R for the TF-TF networks. For the bacterial datasets this observable falls 
in the typical case of a randomized instance. In the case of 5. cerevisiae there is a significant trend which 
indicates absence of rewiring (positive Z score), despite of the larger degree of feedback. This indicates 
that the feedback circuits tend to lie within homology classes. The parameter R quantifies the strength of 
asymmetric regulations between homology classes and is defined in the text. 

classes in the extant network actually suggests that the likely initial asymmetry conserved by du- 
plication was progressively smoothed down by edge rewiring or other evolutionary mechanisms 
with an equivalent effect. 

We define the observable R, which represents a measure of the strength of asymmetric regu- 
lation across the classes. We indicate as Lij the number of regulatory edges between homology 
class i and class j. Let then C; represent the number of elements in homology class i. Then we 
define 

^ _ I Lij — Lji | 

where the sum is performed over all the pairs of homology classes i and j, without considering 
self-interactions of classes. The normalization factor in the denominator is chosen to be a linear 
function of the number of edges. The linear relation is suggested by the experimental observation 
that the number of edges in these TF-TF networks is comparable to the number of nodes. We 
introduced the square root normalization factor in order to compensate for the bias generated by 
the different size of classes i and j [1]. 

Note that the definition the collapsed network absorbs the presence of crosstalks or feedback 
likely inherited by AR duplication, and this is reflected by R. To understand this, let us assume that 
the probability of a rewiring event is smaller than the probability of keeping inherited interactions. 
Then, homology classes, generated by duplication of ancestral genes, would tend to maintain their 

[1] The important fact here is that the normalization factor should compare to the number of nodes, as does the total 
number of edges in the network, alternative choices such as d + Cj or sqrt(CiCj) lead to the same results. 
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inherited regulations, thus measuring an high value of R. In other words, R measures the extent 
to which feedback is retained within homology classes. This likely comes from AR duplication, 
or duplication of existing higher-order feedbacks between homologs. 

To sum up, the larger is R (compared to randomizations), the less likely rewiring has shaped 
the network, and the more the hierarchy between the homology classes is maintained. Evaluation 
of empirical data (Table HQ) shows that in the bacterial data sets, the estimated rewiring is con- 
sistent with, or slightly lower than, typical randomized values. Perhaps unexpectedly, we find an 
even stronger signal for the yeast TF-TF network, where, as we have shown, the feedback is more 
common. Most of this feedback involves homolog TFs, and the network is likely to have been 
shaped by a small amount of rewiring, or by rearrangements that keep into account the homol- 
ogy relations of transcription factors. Note that, empirically, one alternative possibility is that a 
duplication of ARs with inherited crosstalk is followed by a subsequent split the homology class, 
so that the two homologs cease to be classified as such. This phenomenon would lead to a false 
positive for rewiring in our interpretation. However, this problem is less relevant in presence of a 
negative signal, such as the one we find. Another way to put the same observation, is that most of 
the feedback observed in yeast was shaped by duplication with retained crosstalks, rather than by 
rewiring. 

Duplication of Homodimeric and Heterodimeric TFs. An issue where we found notable evo- 
lutionary differences between the evolutionary transcriptional architecture of E. coli and S. cere- 
visiae is the duplication of homo- and heterodimeric TFs[2]. 

It is known from sparse observations that yeast tends to use more heterodimeric TF pairs than 
prokaryotes [39, 40, 41]. A systematic analysis using large protein interaction datasets confirms 



this trend. In E. coli (out of 150 TFs) we find 21 homodimers, and only 5 heterodimer pairs, all 
formed with the histone-like protein HU, which has a rather special status. Though no systematic 
data is available, the common opinion is that it likely that these figures strongly underestimate 
the number of homodimers. A computational evaluation of the TFs available in the 3DComplex 



database |40|] is not able to enrich the sample. On the other hand, it shows that 37/42 TF entries 
in PDB are scored as homodimers, corroborating the common belief. Conversely, in yeast, where 
more systematic data is available, we find (out of 157 TFs) 45 homodimers, and 91 heterodimer 
pairs. 



[2] This analysis was not possible for B. subtilis due to the lack of large-scale interaction data. 
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Co-occurrence 


Empirical 


Randomization 


Z-score 




E. coli 





0.46 ± 1.22 


+0.37 


A 












S. cerevisiae 


84 


32.38 ± 18.95 


+2.72 




E. coli 





0.213 ±0.45 


+0.47 


B 












S. cerevisiae 


15 


4.86 ± 2.29 


+4.43 



TABLE III: A. Homodimer Heterodimer TFs co-occurrence in classes measured by the product of the 
number of homodimers and heterodimers in each class, summed over classes. B. Homologous Heterodimer 
TFs in the same class. 

Inspection of the homology classes (Table ITJI1 and Supplementary Figure) shows that in S. cere- 
visiae, heterodimeric TFs tend to cluster in homology classes, and also to co-occur with homod- 
imers. We interpret this as a strong indication that heterodimers stem from duplications of other 
dimeric TFs. This is measured by evaluating the overrepresentation of number of homodimers and 
heterodimers in homology classes, and the product of homodimers and heterodimers in the same 
class. The same process seems to be forbidden in E. coli, possibly because of the same selective 
pressure erasing crosstalks from duplicate ARs. Due to insufficient data, we could not show that 
homodimeric TFs in his bacterium are likely to form classes of homodimers. 

Longest Path and Shortest Path Hierarchy. Finally, we proceed to evaluate the long and short 
path hierarchy conservation in evolution. The model predicts that they will be conserved, in ab- 
sence of rewiring and multiple AR duplications. In particular, we expect to find many members 
of the same homology class falling in the same computational layer, defined by longest paths to 
a root or shortest paths. Our observations are reported in Table HVl In E. Coli both data sets in- 
dicate a strong tendency to have duplicates in the same layer, with both definitions of hierarchy. 
In B. subtilis we find a similar trend only for the long-path hierarchy (the signal is much weaker, 
possibly due to the small size of the sample). Finally, in yeast we also find evidence for hierar- 
chy conservation all data sets. In the larger dataset, the long -path hierarchy was evaluated for the 
hierarchical component only. 

IV. DISCUSSION AND CONCLUSIONS 

Our results highlight both common and distinct trends in the two bacteria and the yeast, both in 
the topology and in its evolution. We find that the underrepresentation of feedback is common in 
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Path 


Data Set 


Empirical 


Randomization 


Z-score 




E. coli (Shen-Orr) 


844 


687.37 ±48.18 


+3.25 




E. coli (RDB 5.5) 


1171 


420 ± 34.7 


+21.6 


A - Long 


B. subtilis 


377 


353.43 ± 22.81 


+1.03 




S. cerevisiae (Guelzim) 


290 


260 ± 24.7 


+1.21 




S. cerevisiae (Balaji / TF) 


299 


262.88 ± 20.80 


1.74 




E. coli (Shen-Orr) 


1422 


1367.3 ± 15.79 


+3.5 




E. coli (RDB 5.5) 


1193 


453.69 ± 40.88 


+18 


B - Short 


B. subtilis 


676 


654.04 ± 27.15 


+0.19 




S. cerevisiae (Guelzim) 


365 


296.35 ± 25.76 


+2.67 




S. cerevisiae (Balaji / TF) 


572 


543.97 ± 18.25 


+1.54 



TABLE IV: Hierarchy and gene duplications. Preservation of the layer structure, measured by the number 
of homolog pairs occupying the same layer. (A) Longest path. (B) Shortest path (data for the Balaji data-set 
refer to the TF-TF interaction subnetwork). 



the three transcription networks. Indeed, the feedback core is always smaller than what expected 
from the null model, in both the two bacteria and yeast. On the other hand, there is a difference 
in the size of the feedback core of the network, which is 6 times larger in yeast, involving a major 
fraction of the transcription factors. As noted by Jeong and Berman [14211 . this feedback essen- 
tially condensed in one single connected component, and is enriched with TF nodes having the 
endogenous functions of cell cycle and sporulation, and the exogenous functions of diauxic shift 
and DNA repair. However, it must be noted that this feedback component, though large, is much 
smaller than expected by a degree- sequence-conserving null network model, so that the trend of 
underrepresentation of feedback has to be regarded as a common feature of the networks, and is 
particularly strong in yeast. On the other hand, there is a remarkable difference in the hierarchi- 
cal organization between the two evolutionarily distant bacteria and yeast, which emerges more 
prominently in the length of shortest paths in the network. In the bacterial networks, the nodes 
are organized in a small number of hierarchical layers pointing to a minimization of both longest 
and shortest paths between TFs and their targets. Since each computational layer has a cost in 
time [30], this observation would be consistent with a possible constraint on the minimization of 
the time-scales for the production of structural target genes from upstream TFs which sense a par- 
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ticular internal or external signal. In yeast, shortest paths are much longer, both in absolute terms 
and compared to random networks. This means that the number of computations made after an 
upstream TF is activated by a signal, so that the hypothetic time constraints should be softer. This 
feature could be ascribed to the documented more complex post-transcriptional regulation and the 
efficient specific degradation machinery. We also would like to observe that, while the distribu- 
tions of small network motifs seem to be common among known transcription networks [12], the 
nonlocal observables considered here, which evaluate the feedback and hierarchichal organization 
of transcription programs, point to consistent differences in network architecture that accompany 
the transition from prokaryotes to eukaryotes. 

From the evolutionary analysis, we find in all cases significant indications that the existing 
feedback stems from lower-order feedback by gene duplication and inheritance of interactions. 
In particular, AR duplication is significant for all data sets. However, circuits with crosstalks or 
feedback stemming from AR duplication are found in yeast and to a certain extent in B. subtilis, 
but less in E. Coli. On the contrary, we observe that crosstalk conservation in E. coli is not 
frequent. Direct measurement of ARs in E. coli show a consistent signal that the population of 
ARs in homology classes is more dense and more variable than in randomized instances tl8fl. This 
is a direct evidence of duplication and conservation of ARs during evolution. In B. subtilis and 
yeast, the population of ARs falls closer to the null model. The pure analysis of AR population in 
homology classes is not sufficient to assess that ARs evolve from duplication and inheritance of 
self-edges. 

The other face of the medal is that outside of the small feedback cores, a hierarchical orga- 
nization where computational layers are built by gene duplication is visible, more in detail, in 
E. Coli and S. cerevisiae duplicates tend to populate the same layer, indicating conservation of 
hierarchy. The same seems to be true for B. subtilis, but the signal is weaker. A possible interpre- 
tation is again that, in bacteria, minimization constraints on the time-scales for the production of 
target genes, if present, would translate into selective pressure for the reduction of the number of 
computational layers. On the other hand, such pressure could be released from yeast due to more 
efficient post-transcriptional control of gene expression. 

Finally, and along the same lines, we find a strong distinct trend in E. coli and yeast, concern- 
ing the use and evolution of homodimers versus heterodimer transcription factors. In S. cerevisiae 
we found strong indication that heterodimers TFs stem from duplications of ancestral dimeric 
TFs. The same process seems to be forbidden in E. coli, possibly because of the same selec- 
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tive pressure erasing crosstalks from duplicate ARs. One can speculate that this ability to make 
use of heterodimeric binding boosts the combinatorial capacity of a promoter signal integration 
function [43]. 

In conclusion, we presented a comparative analysis of large-scale topological and evolutionary 
properties of transcription networks in three species, focusing on the global aspects of feedback 
and hierarchy in regulatory pathways. This analysis confirms that gene duplication is an important 
drive for the shaping of transcription networks, which follows distinct directions between bacteria, 
where we hypothesize time constraints to impose the observed simple hierarchical structure, and 
yeast, where more intricate pathways arise. Overall, it appears that yeast is able to sustain a 
higher complexity in its topological structure, including more feedback and longer pathways, and 
to explore more freely the possible regulatory interactions stemming from gene duplication, such 
as feedback stemming from self-interactions and dimer transcription factors. Our current work 
explores the hypothesis that part of this higher complexity stems from the known whole-genome 



duplication event |44 , 



45] 



METHODS 

Graph Growth Model. A simple model of network evolution through duplication-divergence 
was considered. At each time step all the nodes of the graph are duplicated, while the number of 
edges rises fourfold. This happens for the following reason: for each edge connecting two original 
(old) nodes (old-old edge), duplication of interaction gives rise to edges between the two old nodes 
and the two duplicate nodes. The original old-old edge therefore generates the four old-old, old- 
new, new-old, new-new edges. Duplication of the graph is followed by erasing of edges with 



prescribed probabilities 118 



3 111 . One can formulate the model with partial or global duplications, 
and including or not the duplication and removal of self-edges (in this case, we considered the 
probability of retaining a self-edge equal to that of any other edge). The behavior of this model 
was compared, through a set of observables, with the observed trends of the experimental data. 

Data Sets. We considered the following data sets for the transcription networks. For E. coli, 
the Shen-orr data-set and the larger and more recent RegulonDB5.5 !§[ |^, l^]. For B. subtilis, 
DBTBS Jisl]. For S. cerevisiae, the Guelzim [29] data-set and the more recent Balaji [26] data set. 
The Balaji data set was modified to include auto regulating interactions taken from the literature. 
In the case of yeast, probably due to the much larger size of the more recent data, there is no 
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compatibility between the two sets of data. Domain architecture data are taken from the SUPER- 

nn n 

FAMILY database H471. 14811. versions 1.61 and 1.69, as in the data sets in [70. Homodimers and 
Heterodimers forS. cerevisiae and E. coli respectively, were obtained from the SGD database I149n 
and from ref. I 5pll . 

Evaluation of Feedback and Hierarchy. We used the leaf-removal algorithm [18] on the data- 
sets (including ARs) and their randomized counterparts. This algorithm prunes the input and 
output tree-like components of a directed network leaving with a "feedback core" of nodes, where 
each node is involved in at least one feedback loop. Each iteration of this pruning algorithm de- 
fines a hierarchical layer. The main observables we considered were the size of the core and the 
number of iterations to reach the core. For the evaluation of shortest paths, we used the Dijkstra 
algorithm 112 811 . considering the distribution of shortest-path lengths and the longest paths. The 
results for the empirical networks were compared to randomizations obtained using a standard 
Markov Chain Monte Carlo (MCMC) algorithm J5I] or an Importance Sampling Monte Carlo 
algorithm j2l|l that preserve the degree sequence (marginals of the adjacency matrix). We con- 
sidered random counterparts of the networks where the only constraints come from the in- and 
out- degree sequences. In particular, we chose not to conserve the number of self-regulators. Our 
previous work II18L 12111 shows that in general this null model yields qualitatively different results. 
In particular, in the case of the Shen-Orr data set the number of layers falls in the average of the 
random ensemble that conserves self -regulations. For the scopes of this work, we did not consider 
this kind of randomization, and we hypothesized that, as in the model for the graph growth by 
duplication, auto-regulatory edges have the same status of other edges. 

Evaluation of Duplications. We used domain architectures from the SUPERFAMILY 
database to build protein architecture databases, one for each specie. We then constructed classes 
of homologous genes using similarity criteria between these architectures, as was done in [14]. 
Two genes are considered homologs if they share the same domains in the same order, neglecting 
domain repeats. For this analysis, proteins coded by the same operon were considered as separate 
entities. Since the definition of homology is rather arbitrary it is rather natural to test different 
definition and observe the stability of results. All these results were filtered for consistency using 
stricter or looser homology criteria Ii36ll . In the case of the S. cerevisiae, and only with respect to 
the network degree distributions (as shown in Fig. |2]), we also considered the notion of homology 
descending from the gene pairs defined by blocks of doubly conserved syntheny described in Q3V{] . 



18 



Duplication of Auto Regulators. We assessed the evolution of the Auto Regulated transcrip- 
tion factors by studying the distribution of the subgraphs shown in Fig. [3] within homology classes. 
Quantitative measure was given by the evaluation of observable a, as described in the main text, 
which was computed on both the experimental data set and on the randomized instances of the 
null model described later. 

Interaction and Rewiring between TFs. We studied the distribution of regulating interaction 
within and across homology classes. A number of observables were implemented to describe the 
relationship standing between the architecture data set and the transcription network data set. To 
quantify the effect of rewiring we introduced the observable R as described in main text. Again, 
this observable was evaluated upon the experimental data set and upon the null model. 

Evaluation of Coevolution ofDimer TFs. We studied the evolutionary properties of homod- 
imers and heterodimers with respect to our duplication and divergence model. We assessed the 
co-existence of homo- and hetero- dimers within the homology classes by evaluating the sum 
(over all the homology classes) of the number of all dimeric pairs found in each homology class. 
The same analysis were performed upon the randomized instances of the null model, and the result 
is shown in Fig. [2] 

Null Model for Duplication. Most of the measures performed upon the experimental data sets 
were compared with a null model of homology. Keeping fixed the homology classes, we randomly 
shuffled the architecture associations to the gene names. This randomizes all the interaction exist- 
ing between homology properties (which are responsible of class generation) and all the other data 
sets (adjacency list of transcription network, homodimers, heterodimer pairs). The advantage of 
this null model lies in the fact that no experimental database is actually randomized, only their in- 
teractions. This is important as this randomization does not destroy the homology information nor 
the global (large scale) properties of the network. Finally, gene name shuffling was done separately 
for TFs and TGs, as was done in II 1411 . due to their inherently different DNA-binding properties 
(which depend on their domains). The data shown in this paper correspond to 10 5 randomized 
instances of the null model, allowing us to estimate P-values larger than 1 x 10~ 5 . 
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Supplementary Figure 1: Histogram of shortest-path lengths in the yeast TF network (Balaji data set), 
compared to a typical randomized instance. The graph plots the number of shortest paths of a given length. 
The two distributions have similar qualitative features, in particular showing exponential decay and the 
same location of the maximum. However, they differ significantly in their tails, where the empirical graph 
shows a higher abundance of longer shortest-paths. 
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E. coli (RDB 5.5) 
Subgraph type 


# Empirical 


Randomization 


P-value 


(i) 


1 


7.877 ± 3.605 


0.0017 


(ii) 


3 


0.412 ± 0.640 


0.0087 


(iii) 


6 


4.089 ± 2.639 


0.2497 


(iv) 


1 


0.1072 ±0.319 


0.1042 


(v) 





0±0 


n.a. 


(vi) 


179 


81.55 ± 15.24 


< io- 4 



B. subtilis (DBTBS) 
Subgraph type 


# Empirical 


Randomization 


P-value 


(i) 


1 


1.2472 ± 1.0994 


0.2824 


(ii) 


2 


1.7739 ± 1.8284 


0.4327 


(iii) 


1 


0.2370 ± 0.4872 


0.2101 


(iv) 


1 


0.0755 ± 0.2642 


0.0755 


(v) 





0.0758 ± 0.2646 


0.0758 


(vi) 


55 


29.237 ± 10.672 


0.0242 



S. cerevisiae (Balaji) 
Subgraph type 


# Empirical 


Randomization 


P-value 


(i) 


7 


8.756 ± 3.991 


0.3074 


(ii) 


6 


2.373 ± 1.435 


0.0243 


(iii) 





0.209 ± 0.414 


< io- 4 


(iv) 





0±0 


n.a. 


(v) 


1 


0.199 ± 0.399 


0.1995 


(vi) 


16 


17.49 ±6.51 


0.4308 



Supplementary Table 1: Subgraphs compatible with AR duplications, containing crosstalks. The tables 
report the occurrence of different network subgraphs which can stem from duplications of nodes and edges 
and subsequent partial loss of regulatory interactions for the three transcription networks, compared to 
randomizations of the homology classes. These subgraphs are displayed in Fig. [3] 

26 



0.03 




Sum-Product of Homolog 
Homodimer-Heterodimer Occurrence 

Supplementary Figure 2: Heterodimer / homodimer TF duplication in yeast. The x axis reports the observ- 
able quantifying the co-occurrence of homodimers and heterodimers in the same homology class (the sum 
over classes of the product of number of heterdimers and homodimers within each class), for the empirical 
network of yeast (Balaji data-set), compared to randomizations (histogram). The histogram corresponds to 
the data shown in Table UlTI 
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